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Abstract 

Here  we  report  on  development  of  a  high  order  finite  element  code  for  the  solution  of  the 
shallow  water  equations  on  the  massively  parallel  computer  MP-1104.  We  have  compared 
the  parallel  code  to  the  one  available  on  the  Amdahl  serial  computer.  It  is  suggested  that 
one  uses  a  low  order  finite  element  to  reap  the  benefit  of  the  massive  number  of  processors 
available. 


1  Introduction 

The  shallow  water  equations  are  first  order  nonlinear  hyperbolic  partial  differential  equations 
having  many  applications  in  Meteorology  and  oceanography.  These  equations  can  be  used 
in  studies  of  tides  and  surface  water  run-off.  They  may  also  be  used  to  study  large-scale 
waves  in  the  atmosphere  and  ocean  if  terms  representing  the  effects  of  the  Earth's  rotation 
are  included.  See  review  article  by  Neta  (1992). 

Indeed,  it  had  become  customary,  in  developing  new  numerical  methods  for  weather 
prediction  or  oceanography,  to  study  first  the  simpler  nonlinear  shallow  water  equations, 
which  possess  the  same  mixture  of  slow  and  fast  waves  as  the  more  complex  baroclinic 
three-dimensional  primitive  equations.  One  of  the  issues  associated  with  the  numerical 
solution  of  the  shallow  water  equations  is  how  to  treat  the  nonlinear  advective  terms  (Cullen 
and  Morton,  1980,  Navon,  1987).  In  this  paper  the  two-stage  Galerkin  method  combined 
with  a  high  accuracy  compact  approximation  to  the  first  derivative  is  used.  The  method 
was  developed  by  Navon  (1987).  See  also  Navon  (1979a,  1979b,  1983).  Our  work  here  is  to 
discuss  porting  issues  of  finite  element  onto  a  massively  parallel  machine.  Section  2  discusses 
the  algorithm,  section  3  discusses  the  MasPar  hardware  and  software.  In  section  4  we  detail 
our  numerical  experiments  and  compare  the  results  to  the  code  running  on  the  Amdahl  serial 
computer. 

2  Finite  Element  Solution 

The  barotropic  nonlinear  shallow-water  equations  on  a  limited-area  domain  of  a  rotating 
earth  (using  the  /?-plane  assumption)  have  the  following  form: 

ut    +    uux  +  vuy  +  <px  —  fv  =  0 

vt    +    uvx  +  vvy  +  (fy  +  fu  =  0        0  <  x  <  L,  0  <  y  <  D,  t  >  0 

Here  u  and  v  are  the  velocity  components  in  the  x  and  y  directions  respectively,  /  is  the 
Coriolis  parameter  approximated  by  the  (3  plane  as 


where  /?,  /0,  are  constants  and  tp  =  gh  is  the  geopotential  height.  Periodic  boundary 
conditions  are  assumed  in  the  x  direction  and  rigid  boundary  conditions  (v  =  0)  are  imposed 
in  the  t/-direction.  The  domain  is  a  cylindrical  channel  simulating  a  latitude  belt  around 
the  earth  (see  e.g.  Hinsman,  1975).  The  finite  element  approximation  leads  to  systems  of 
ODES  which  can  be  finite  differenced  in  time  (see  e.g.  Douglas  and  Dupont,  1970).  In  the 
two  stage  Galerkin  (originally  proposed  by  Cullen,  1974),  we  let  any  of  the  4  derivatives  in 
the  nonlinear  terms  be  approximated  by  the  compact  Numerov  scheme,  i.e.  for 

du 

OX 

we  have 

—  [zi+2  +  16zt+i  +  362,  +  162,-1  +  2,_2]  = 

—  [-5u,_2  -  32u,_!  +  32u,+1  +  5ui+2] 

Similarly  for  zxv,  zyu  and  2yv.The  approximation  of  jr-  requires  an  interpolation  of  the  bound- 
ary values  Vo,  v^r+i 

v0  =  4vi  —  6v2  +  4v3  —  v4 

VN+1   =  4vN  -  6vN-y  +  4u/v_2  -  VN-3 


dv 
dy 
dv 
dy 


— 25vi  +  48t>2  —  36u3  +  16u4  —  3v5 
i  ==  \2h 

3uat_4  —  16u;v_3  +  36t>7V-2  —  48vyv-i  +  25uat 
n  I2h 


This  stage  will  require  a  solution  of  a  pentadiagonal  system.  For  the  second  stage,  we  let  w 
be  any  of  the  four  nonlinear  terms  and  we  solve  a  tridiagonal  system.  For 


w  =  vz 


we  h 


ave 


g(«>j-i  +  4wj  +  Wj+i)    =    ±(vj-iZj-i  +  VjZj-i  +  v,'_i*y+ 

vj+lzi  +  VjZj+1  +  VHlZJ+l  +  QVjZj) 

This  two  stage  approximation  yields  O  (hs)  approximation  to  the  derivatives  uXiuyivx  and 

Vy. 

Now  the  approximation  of  the  shallow  water  equations  becomes 

M(u^+1  -  uj)  +  At[(tf*„)J  +  (vZyu);  -  frf]  =  AtK21 

M(v]+1  -  t>?)  +  At[(vzyv)*  +  u;+1(2xr),  +  /-u?+1]  =  AtK31 


where 


M(^+1  ~  *D  "  l^K^?1  +  V>J) 


#21    =     2^"+   +^"") 


#31      =      ^3Bl+1+^3l) 


'u 


=  ///*« 


M«    =     /  /  V,  K  <M 


014 


31 


"   ?/£^ 


/rcr1  =  >  /   »ri:^v:-iM 


<9y 


«  «  ^SL^dA 

and  where  K  are  the  finite  element  shape  functions. 

u'  =  u"+1/2  =  |un  -  iu""1  +  0  (At)2) 

and  similarly  for  v*. 

Schuman  (1957)  filter  was  applied  every  12  time  steps  to  the  v  component  of  velocity  in 
order  to  recover  the  higher  accuracy  of  the  method. 

Since  the  two-stage  Galerkin  method  does  not  conserve  integral  invariants  (Cullen  [1979]) 
we  apply  an  aposteriori  technique  using  an  augmented  Lagrangian  nonlinearly  constrained 
optimization  approach  for  enforcing  the  conservation  of  integral  invariants  of  the  shallow 
water  equations  (see  Navon  and  deVilliers  (1983)  and  Navon  (1983)). 

3      System  Overview 

The  MasPar  family  of  massively  parallel  processing  systems  consists  of  arrays  of  IK  to  16K 
processing  elements  (PE),  a  scalar  control  unit  (ACU)  and  a  UNIX  subsystem.  Architec- 
turally, each  PE  is  a  custom  64-bit  RISC  processor  with  48  32-bit  registers  and  64  KB  of  data 
memory.  All  PEs  execute  instructions  which  are  broadcast  from  the  ACU  on  data  stored  in 
their  local  memory.  Although  there  is  only  a  single  instruction  stream,  the  processors  have 
a  number  of  autonomies,  including  the  ability  to  generate  independent  addresses  for  indirect 
loads  and  stores  to  memory. 


The  PEs  share  data  using  two  communication  mechanisms:  the  xnet  and  the  router. 
The  xnet  is  an  eight-way  nearest  neighbor  mesh  that  is  used  for  structured  communications 
such  as  stencil  operations  in  finite  difference  codes.  The  router  is  a  multi-stage  circuit- 
switched  network  for  global  or  random  communication  patterns.  I/O  to  and  from  the  PEs  is 
transferred  via  the  router  to  an  external  memory  buffer  called  I/O  RAM.  From  I/O  RAM, 
data  can  asynchronously  be  transferred  to  a  wide  variety  of  devices  such  as  disk  arrays, 
frame  buffers,  or  other  machines.  The  MasPar  Disk  Array  (MPDA)  provides  up  to  22 
GB  of  formatted  capacity  as  a  true  UNIX  file  system.  The  UNIX  subsystem  provides  the 
programming  and  run-time  environment  to  users. 

3.1      MasPar  Software 

The  MasPar  system  is  programmed  in  either  MPL,  a  parallel  extension  to  ANSI  C,  or  Mas- 
Par  Fortran,  an  implementation  of  Fortran  90.  In  MasPar  Fortran  (MPF)  parallel  operations 
are  expressed  with  the  Fortran  90  (F90)  array  extensions  which  treat  entire  arrays  as  manip- 
ulatable  objects,  rather  than  requiring  them  to  be  iterated  through  one  element  at  a  time. 
F90  has  also  added  a  significant  number  of  intrinsic  libraries;  operations  such  as  matrix 
multiplication  and  dot  product  are  part  of  the  language.  Since  Fortran  90  is  a  standard 
defined  by  the  ANSI/ISO  committees,  programs  are  architecture  independent  and  can  be 
transparently  moved  to  other  platforms. 

Fortranll  Fortran90 

doi  =  1 ,  256  a  =  b  +  c 

do  j  =  1,256 
a(i,j)  =  b{i,j)  +  c(ij) 

enddo 
enddo 

The  Fortran  90  code  can  be  run  on  any  computer  with  a  F90  compiler.  On  a  scalar  machine 
such  as  a  workstation,  the  arrays  will  be  added  one  element  at  a  time;  just  as  if  it  had  been 
written  in  Fortran  77.  On  a  vector  machine,  the  number  of  elements  added  at  a  time  is 
based  on  the  vector  length;  a  machine  with  a  vector  length  of  64  will  add  64  array  elements 
at  once.  The  MasPar  machine  acts  like  a  vector  machine  with  a  very  long  vector.  On  a  16K 
MasPar  machine,  16384  arrays  elements  are  added  simultaneously. 

MasPar  provides  key  routines  in  math,  signal,  image,  and  data  display  libraries.  The 
Math  Library  (MPML)  contains  a  number  of  high-level  linear  algebra  solvers,  including 
a  general  dense  solver  with  partial  pivoting,  a  Cholesky  solver,  a  conjugate  solver  with 
preconditioning,  and  an  out-of-core  solver.  MPML  also  includes  a  set  of  highly-tuned  linear 
algebra  building  blocks,  analogous  to  BLAS  on  vector  machines,  from  which  the  user  can 
develop  additional  solvers.  The  Data  Display  Library  provides  a  convenient  interface  to 
graphically  display  data  from  within  a  program  as  it  is  executing. 

The  MasPar  Programming  Environment  (MPPE)  is  an  integrated,  graphical  environment 
for  developing,  debugging,  and  tuning  applications.  MPPE  provides  a  rich  set  of  graphical 
tools  that  allow  the  user  to  interactively  control  and  visualize  a  program's  behavior.    The 


statement  level  profiler  allows  the  user  to  quickly  identify  the  compute-intensive  sections  of 
the  program  while  the  machine  visualizer  details  the  use  of  hardware  resources.  Each  of 
these  tools  are  continuously  available  without  having  to  recompile,  even  if  a  program  has 
been  compiled  with  optimizations. 

4     Program 

The  program  is  modular  and  is  complemented  with  easily  reachable  switches  controlling 
print  and  plot  options.  The  Input  to  the  program  consists  of  a  single  line  containing  the 
following  six  parameters: 

DT  -  the  time  step  in  seconds  (F5.2) 

NLIMIT  -  total  number  of  time  steps  (15) 

MF  -  number  of  time  steps  between  printing  solution  (15) 

NOUTU  -  to  print  (1)  or  not  to  print  (0)  the  u-component 

NOUTV  -  to  print  (1)  or  not  to  print  (0)  the  u-component 

NPRINT  -  to  print  (1)  or  not  to  print  (0)  the  global  nodal  numbers  of  each  triangular 
elements  and  the  indices  and  node  coordinates  of  the  nonzero  entries  of  the  global  matrix. 

The  main  program  initializes  all  variables  and  then  reads  the  only  data  card  of  the 
program.  It  then  proceeds  to  index  and  label  the  nodes  and  the  elements,  thus  setting  up 
the  integration  domain.  This  is  done  by  subroutine  NUMBER. 

Subroutine  CORRES  determine  the  nonzero  locations  in  the  global  matrix  and  stores 
them  in  array  LOCAT.  The  initial  fields  of  height  and  velocity  are  set  up  by  subroutine 
INCOND.  The  derivatives  of  the  shape  functions  (Vj)  are  calculated  in  AREA  A.  A  compact 
storage  scheme  for  the  banded  and  sparse  global  matrices  is  implemented  in  subroutine 
ASSEM.  The  method  is  based  on  the  fact  that  the  maximum  number  of  triangles  supporting 
any  node  is  six.  Three  different  types  of  element  matrices  (3  x  3)  will  be  required  for  assembly 
in  the  global  matrices. 

A  switch,  denoted  NSWITCH  is  set  for  selecting  between  the  different  types  of  element 
matrices.  After  setting  up  the  time  independent  global  matrices  the  program  proceeds  to 
the  main  do-loop  which  performs  the  time-integration  and  which  is  executed  once  for  every 
new  time-step. 

As  the  solution  of  the  nonlinear  constrained  optimization  problem  of  enforcing  conser- 
vation of  the  nonlinear  integral  invariants  requires  scaling  of  the  variables,  the  scaling  is 
performed  in  the  main  program  as  well  as  in  subroutine  INCOND. 

In  the  main  integration  loop  the  simulation  time  is  set  up  and  adjusted  and  then  the 
subroutines  ASSEM  and  MAMULT  set  up  and  assemble  the  global  matrices  which  then  are 
added  up  in  a  matrix  equation,  first  for  the  continuity  equation  and  in  a  similar  manner  for 
the  u  and  r-momentum  equations. 

Subroutine  SOLVER  then  is  called  to  solve  the  resulting  system  of  linear  equations  (of 
block  tridiagonal  form)  by  the  conjugate  gradient  square. 

The  new  field  values  for  the  geopotential  and  velocities,  $*■  ,u£-  ,t>£+1  respectively,  are 
used  immediately  as  obtained  in  solving  the  coupled  shallow-water  equations  system.  For 
the  u  and  v-momentum  equations,  the  new  two-stage  Numerov-Galerkin  scheme  is  imple- 


mented.  Separate  routines  are  set  up  for  the  x  and  y-derivatives  advection  terms,  DX  and 
DY  respectively.  Subroutine  DX  implements  the  two-stage  Numerov-Galerkin  algorithm  de- 
scribed previously  for  the  advective  terms  in  the  u  and  u-momentum  equations  involving  the 
x-derivative. 

In  the  first  stage  it  calculates  the  0(h8)  accurate  generalized-spline  approximation  to 
the  (du/dx)  first  derivative  by  calling  upon  subroutine  CYCPNT  which  solves  a  periodic 
pentadiagonal  system  of  linear  equations  generated  by  the  spline  approximation. 

In  the  second  stage  it  implements  the  second  part  of  the  Numerov-Galerkin  algorithm 
for  the  nonlinear  advective  term  u(du/dx)  and  solves  a  cyclic  tridiagonal  system  by  calling 
upon  subroutine  CYCTRD.  Subroutine  DY  implements  the  two-stage  Numerov-Galerkin 
algorithm  described  previously  for  the  advective  terms  in  the  u  and  v-momentum  equations 
involving  the  y  —  derivative.  In  its  first  stage  it  calculates  the  0(h6)  accurate  generalized- 
spline  approximation  to  the  (du/dy)  first  derivative  by  calling  upon  subroutine  PENTDG 
which  solves  the  usual  pentadiagonal  system  of  linear  equations  generated  by  the  generalized- 
spline  approximation. 

In  the  second  stage  subroutine  DY  implements  the  second  part  of  the  Numerov-Galerkin 
algorithm  for  the  nonlinear  advective  term  u(du/dy)  and  solves  the  Galerkin  product  by 
calling  upon  subroutine  NCTRD  to  solve  a  special  tridiagonal  system. 

The  boundary  conditions  are  implemented  by  subroutine  BOUND.  Periodically,  a  Schu- 
man  filtering  procedure  is  implemented  for  the  ^-component  of  velocity  only,  by  calling 
subroutine  SMOOTH.  The  integral  invariants  are  calculated  at  each  time-step  by  calling 
subroutine  LOOK.  If  the  variations  in  the  integral  invariants  exceed  the  allowable  limits 
^Ei^Hi  or  t>z,  the  Augmented-Lagrangian  nonlinear  constrained  optimization  procedure  is 
activated.  The  unconstrained  optimization  uses  the  conjugate-gradient  subroutine  E14DBF 
of  the  NAG(1982)  scientific  library.  Subroutine  E14DBF  calls  a  user-supplied  subroutine 
FUNCT  which  evaluates  the  function  value  and  its  gradient  vector  as  well  as  subroutine 
MONIT  whose  purpose  is  merely  to  print  out  different  minimization  parameters. 

After  a  predetermined  number  of  steps,  subroutine  OUT  is  called,  which  in  turn  calls 
upon  the  subroutines  LOOK  to  calculate  the  integral  invariants.  Practically  4-5  augmented- 
Lagrangian  minimization  cycles  were  determined  to  be  sufficient. 

We  ran  the  program  under  MPPE  and  the  following  table  shows  the  CPU  time  used  by 
some  of  the  routines.  All  others  require  less  than  5%  each.  Therefore  we  have  decided  to  par- 
allelize ASSEM,  MAMULT,  SOLVER  (switching  from  Gauss  Seidel  to  Conjugate  Gradient 
Square).  Other  subroutines  we  parallelized  are: 

CORRES,  INCOND,  LOOK,  MONIT,  NUMBER  and  AREAA. 

After  this,  the  most  time  consuming  routines  become  E14DBF  and  FUNCT.  These  are 
required  only  if  the  integral  constraints  are  not  conserved.  Therefore  if  the  mesh  is  fine, 
these  routines  will  not  be  called.  Our  numerical  experiments  confirmed  that  these  two 
routines  were  called  only  in  the  coarsest  grid  case. 

The  next  set  include:  DX,  DY,  CYCTRD,  CYCPNT,  NCTRD,  PENTDG,  TRIDG,  and 
SMOOTH.  We  have  decided  not  to  try  at  this  point  to  parallelize  these  or  BOUND.  We 
have  ran  this  program  on  the  MP- 11 04  (4096  processors)  on  a  variety  of  grid  sizes.    The 


Routines  CPU 

SOLVER  32% 

ASSEM  25% 

MAMULT  14% 

CORRES  5% 

BOUND  5% 

Table  1:  CPU  time  used  by  some  routines 

original  program  was  also  ran  on  the  Amdahl  5990/500  serial  computer.  All  computations 
were  performed  in  double  precision.  The  domain  is  a  rectangle  6000  km  by  4400  km.  The 
coarsest  mesh,  Ax  =  Ay  =  400km.  This  means  that  the  number  of  grid  points  in  the 
x-direction,  NC,  is  15  and  the  number  of  grid  points  in  the  y-direction,  NROW  is  11.  (At 
will  be  adjusted  for  stability.)  The  number  of  time  steps,  NLIMIT,  is  30. 

NC  NROW  Ax(fcm)  Ay{km)  A*(sec)  Amdahl  (sec)  MP-1104(sec) 

15  11  400  400  18.  1.14  14 

48  45  133|  133|  5.51  13.52  31.3 

63  62  93.75  70.97  4.22  24.8  44.3 

88  85  51.76  51.76  3.03  48.32  80 

128  125  46.87  46.87  2.10  -  164 

Table  2:  Total  CPU  time  for  several  grids 
The  initial  condition  for  the  height  field  is  given  by 

,/       x      rr                ,  9(D/2-y)  H2  .    2ttx 

M*,y)  =  //o  +  #itanh — +  __z_sm  — 

where 

H0  =  2000m,         Hi  =  -220m,         H2  =  133m, 


and 

/o  =  lO~4sec-\         j3=1.5x  10-11  sec^mT1. 

This  initial  condition  is  given  in  Grammelt  veldt  (1969)  and  tested  by  several  researchers 
(Cullen  and  Morton  (1980),  Gustafsson  (1971),  Navon  (1987)  etc.)  The  initial  velocity  fields 
were  derived  from  the  initial  height  field  via  the  geostrophic  relationships 

g  dh 

g  dh 

Table  2  gives  the  CPU  time  for  each  grid. 

If  we  compare  the  CPU  time  for  three  of  the  subroutines  we  parallelized  (to  avoid  the 
difficulty  that  some  parts  are  still  running  on  the  front  end)  we  find  that  in  MAMULT  and 
SOLVER  we  were  able  to  cut  the  CPU  time.  The  results  are  summarized  in  Table  3. 


Subroutine 

Problem  size 

Amdahl  (sec) 

MP-1104  (sec) 

ASSEM 

48  by  45 

3.02 

5.77 

63  by  62 

5.47 

8.56 

88  by  85 

10.49 

15.2 

128  by  125 

- 

34.4 

MAMULT 

48  by  45 

.42 

.44 

63  by  62 

.74 

.37 

88  by  85 

1.44 

.88 

128  by  125 

- 

1.53 

SOLVER 

48  by  45 

7.21 

5.97 

63  by  62 

13.14 

4.87 

88  by  85 

25.38 

10.6 

128  by  125 

- 

17.9 

Table  3:  CPU  time  before  and  after  parallelization 

The  code  was  ran  under  profiler  and  we  found  that  now  the  CPU  usage  (in  percent  of 
total  CPU)  is  as  given  in  table  4. 

It  is  clear  that  one  should  parallelize  DX,DY,PENTDG,TRIDG  and  LOOK.  The  first 
four  require  that  one  parallelizes  the  subroutines  NCTRD,CYCTRD  and  CYCPNT.  This  is 
not  done  since  the  tridiagonal  and  pentadiagonal  systems  to  be  solved  are  of  order  NC.  We 
feel  that  one  should  approach  this  problem  slightly  differently.  Instead  of  trying  to  parallelize 
this  code  which  is  of  high  order,  we  should  parallelize  a  low  order  finite  element  code  for  the 
shallow  water  equations.  The  accuracy  of  the  solution  will  be  obtained  by  using  an  even  finer 
mesh  than  46  km  (NC=128)  we  used  above.  It  will  be  interesting  to  compare  the  accuracy 
and  efficiency  of  the  two  codes  on  MP-1104  machine. 


Subroutine     15  by  11     44  by  45     88  by  85     128  by  125 


FUNCT 

36.8 

- 

- 

- 

DX 

3.2 

12.3 

17.0 

18.6 

DY 

3.2 

12.8 

16.6 

20.0 

ASSEM 

10.2 

17.9 

16.0 

14.3 

PENTDG 

2.5 

12.0 

13.7 

11.4 

MAMULT 

16.2 

13.7 

9.8 

6.9 

TRIDG 

1.2 

6.5 

6.9 

5.2 

LOOK 

9.1 

4.1 

4.4 

8.4 

NCTRD 

.7 

3.2 

3.3 

2.5 

CYCPNT 

.7 

3.9 

3.2 

2.4 

CYCTRD 

.8 

2.6 

2.1 

1.5 

SOLVER 

8.0 

4.0 

1.9 

1.1 

SET  STI 

1.0 

1.7 

1.4 

1.2 

BOUND 

1.8 

1.7 

1.0 

3.9 

VFEUDX 

1.8 

1.3 

.6 

.5 

rest 

2.8 

2.1 

2.1 

2.1 

Table  4:  CPU  time  by  subroutine  after  parallelization 

Conclusion 

We  have  developed  a  high  order  finite  element  code  to  solve  the  shallow  water  equations 
on  the  MasPar  massively  parallel  computer  MP-1104.  It  is  believed  that  a  low  order  finite 
element  code  will  be  more  efficient  on  the  MP-1104  computer. 
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SUBROUTINE  ASSEM  (COMA, STI ,NSWTCH,CODI , AREA, tnod, tlocat) 

INCLUDE  'PAR' 

include  ' inter_info' 

real  C0MA(7,N0DEB) ,STI(NNOD,NNOD,NELE) 

real  C0DI(NN0D,NELE) 

integer  itemp,  tl,  t2,  t3,  t4,  tO,  timel,  time2 

real  tgmat(7,NELE) 

integer  tlocat (6, NODE) ,  tnod(NNOD.NELE) ,  nswtch 

CMPF  ONDPU  STI, CODI, COMA 

cmpf  map  coma(memory, allbits) 

cmpf  map  codi (memory ,allbits) 

cmpf  map  tgmat (memory ,allbits) 

cmpf  map  tlocat (memory ,allbits) 

cmpf  map  tnod (memory, allbits) 

cmpf  map  sti (memory, memory , allbits) 

COMA  =0.0 
C 

C    DECIDE  WHICH  ELEMNT  MATRIX  MUST  BE  CALCULATED 

GOTO  (100,200,500,600),  NSWTCH 
C 
100   continue 

call  assem_sl(codi,  tgmat,  tlocat,  tnod) 

coma(:,:)  =  tgmat (:, :N0DEB) 

RETURN 
C 
200     continue 

write(6,*)  '  error  for  nsvitch  -  2' 
RETURN 
C 
500   continue 

call  assem_s2(area,  tgmat,  tlocat,  tnod) 
coma(:,:)  ■  tgmat (:, :N0DEB) 
RETURN 
C 
600    continue 

call  assem_s3(sti,  tgmat,  tlocat,  tnod) 
coma(:,:)  =  tgmat (:, :N0DEB) 
return 

END 
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subroutine  assem_sl(codi,  gmat,  locat,  nod) 

include  'PAR' 

real,  intent (in)   ::  codi(NN0D,  NELE) 

real,  intent(out)  ::  gmat(7,NELE) 

integer,  intent(in)  ::  locat (6, NODE) ,  nod (NNOD, NELE) 

real  tcodi (NNOD, NELE) 
cmpf  map  codi (memory, allbits) 
cmpf  map  gmat (memory , allbits) 
cmpf  map  locat (memory .allbits) 
cmpf  map  nod(memory, allbits) 
cmpf  map  tcodi (memory , allbits) 

integer  irow(NELE),  icol(NELE),  i,  k,  j,  1 

gmat  =  0 . 
tcodi  =  codi/6. 

do  100  k  *  1,  NNOD 
irow  =  nod(k, : ) 

do  150  j  *  1,  NNOD 
icol  =  nod(j , :) 
if(  k  .eq.  j)  then 


cmpf  collisions 


cmpf  collisions 


gmat(7,irow)=gmat  (7,  irow) +t codi  (j  , :) 
goto  150 
endif 
do  200  1  *  1,  6 

where(locat (l,irow)  .eq.  icol) 


gmat (1, irow)  =  gmat (1, irow)  +  tcodi (j,:) 
end  where 
200  continue 

150  continue 

100      continue 

return 
end 
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subroutine  assem_s2(area,  gmat,  locat,  nod) 
include  'PAR' 

real  area,  tareal2,  tarea6 

integer        locat (6, NODE) ,  nod(NNOD,NELE) 

real         gmat(7,NELE) 
cmpf  map  gmat (memory ,allbits) 
cmpf  map  locat (memory, allbits) 
cmpf  map  nod (memory, allbits) 

integer  irow(NELE),  icol(NELE),  i,  k,  j 

gmat  =  0. 

tareal2  =  area/12. 

tarea6  =  tareal2  *  2. 


do  100  k  =  1,  NNOD 
irow  =  nod(k, : ) 

do  150  j  *  1,  NNOD 
icol  =  nod(j , :) 

if(  k  .eq.  j)  then 


cmpf  collisions 


gmat (7, irow)=gmat(7,irow)+tarea6 
goto  150 
endif 
do  200  1  ■  1,  6 

vhere(locat (l.irow)  .eq.  icol) 


cmpf  collisions 


gmat (1, irow)  =  gmat (1, irow)  +  tareal2 
end  where 
200  continue 

150  continue 

100      continue 

return 
end 
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subroutine  assem_s3(sti,  gmat ,  locat,  nod) 

include  'PAR' 

real,  intent(out)  ::  gmat (7,NELE) 

real,  intent(in)  ::  sti(NNOD,NNOD,NELE) 

integer,  intent (in)  ::  locat (6, NODE) ,  nod(NNOD,NELE) 

cmpf  map  gmat (memory ,allbits) 

cmpf  map  locat (memory ,allbits) 

cmpf  map  nod (memory ,allbits) 

cmpf  map  sti(memory .memory ,allbits) 

integer  irow(NELE),  icol(NELE),  i,  k,  j,  1 

gmat  =  0 . 

do  100  k  =  1,  NNOD 
irow  =  nod(k, :) 

do  150  j  =  1,  NNOD 
icol  -   nod(j , :) 
if(  k  .eq.  j)  then 


cmpf  collisions 


gmat(7,irow)=gmat  (7,  irow)+sti(k,j  ,  :) 
goto  150 
endif 
do  200  1  =  1,  6 

where(locat (l,irow)  .eq.  icol) 


cmpf  collisions 


gmat(l,irow)  =  gmat(l,irow)  +  sti(k,j,:) 
end  where 
200  continue 

150  continue 

100      continue 

return 
end 
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SUBROUTINE  MAMULT  (COMA, VECTOR, RIGHT, locat) 
INCLUDE  'PAR' 

real,  intent (in)  ::  C0MA(7,N0DEB) ,  vector (:) 
real,  intent(out)  ::  RIGHT(:) 
integer,  intent (in)  ::  locat (6, NODE) 
integer  nloc(NODE) 
integer  kr 

cmpf  map  coma (memory, allbits) 
cmpf  map  locat (memory , allbits) 

right  =  0. 

RIGHT(:NODE)  =  coma(7, :node)*vector ( mode) 

DO  80  KR=1,6 

nloc ( : ) =locat (kr , : ) 
where (nloc( : ) .ne.O) 
k  RIGHT(:node)  =  RIGHT(:node)  +  C0MA(KR, mode) *VECT0R (nloc ( : )) 

80     CONTINUE 

RETURN 
end 

Conjugate  Gradient  Square  (CGS)  method  to  solve  non-symmetric 
positive  definite  metrix.   Ax  -  b 

coma  -  input  matrix  A 
right  ■  b 
xsolv  =  x 

subroutine  my.solver (coma , right , xsolv , eps , it ermx , locat) 
include  'PAR' 
include  'mamult.if 

real  coma(7,N0DEB) ,  eps 

real  right (NODEB) ,  xsolv (NODEB) 

real,  dimension(NODE)  ::  r,  rbar,  p,  pi,  q,  u,  mv 

real*8,  dimension(NODE)  ::  brbar,  bpl 

real  beta,  convl,  conv2,  restl,  rest2 

real  delO,  dell,  alpha,  residual(lOO) 

real*8  bdelO,  bdell 

integer  locat (6, NODE) 

integer  itermx,  i,  j 

common/debug/ntime 
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cmpf  map  coma(memory ,  allbits) 
cmpf  map  locat (memory , allbits) 
r  =  right  (mode) 
convl  =  dotproduct(r,r) 
call  mamult (coma,  xsolv,  mv,  locat) 
r  *  r  -  mv 
rbar  ■  r 
p  =  r 
u  -  r 

itermx  =  100 
pi  *  0. 
eps  =  1.5e-6 
do  10  i  =  1,  itermx 

call  mamult (coma,  p,  pi,  locat) 

delO  =  dotproduct (rbar, pi) 

dell  =  dotproduct (rbar, r) 

alpha  =  dell/delO 

q  =  u  -  alpha*pl 

u  =  u  +  q 

xsolv(:node)  =  xsolv(:node)  +  alpha*u 

call  mamult (coma,  u,  pi,  locat) 

delO  =  dell 

r  *  r  -  alpha*pl 

conv2  =  dotproduct (r,r) 

residual (i)  ■  sqrt(conv2/convl) 

if(  residual (i)  .It.  eps)  then 
return 

endif 

dell  =  dotproduct (rbar, r) 

beta  =  dell/delO 

u  ■  r  +  beta*q 

p  =  u  +  beta  *  (q  +  beta*p) 
10      continue 

PRINT  2001 
2001    FORMAT  (1X,'N0  CONVERGENCE') 
do  20  i  =  1,  itermx 

write(6,  *)  i,  '  residual  is  ',  residual(i) 
20      continue 
stop 
END 
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